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Abstract 



In this paper we present some general solution of the system of linear equa- 
. tions formed by Guibas and Odlyzko in Th.3.3 [2j. We derive probabilities for 

given patterns to be first to appear in random text and the expected waiting time 
I till one of them is observed and also till one of them occurs given it is known 

■ which pattern appears first. 
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1 Introduction 



It is a classical problem in probability theory to study occurrence of patterns in a 
random text formed by independent realizations of letters chosen from finite alphabet. 
>• . Feller in [5] considered randomness and recurrent patterns connected with Bernoulli 

trials. Solov'ev in ||^ found the formula on the expected waiting time for an appearance 
00 ! of one pattern. Penney in [5] proposed a coin-flip game with a coin tossed repeatedly 

' until one of patterns appeared, then this pattern wins. A formula for computing the 

g ■ odds of winning for two competing pattern was discovered by Conway and described by 

^ ■ Gardner pQ. A martingale approach to the study of occurrence of patterns in repeated 

experiments was presented by Li ^4j. 

Results dealing with the occurences of patterns have been applied in several areas of 
^ I information theory including source coding, code synchronization, randomness testing, 

^ ' etc. They are also important in molecular biology in DNA analysis and for gene 

■ ■ ■ recognition. 

Guibas and Odlyzko in Th.3.3 [2] formed the system of linear equations which 
relates the generating function of probabilities that given pattern occurs before others 
and the generating function for tails distribution of probabilities until some pattern 
appears. In this paper we present some general solution of this system (Th J2.3p . It 
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has a general form but it allows us to obtain formulas on the probability that a given 
pattern precedes the remaining ones and a new formula on the expected waiting time 
until some pattern occurs in a random text (Cor J2.4l and I2.5p . For complementary of 
our presentation we recall in a general context some facts obtained in [8] dealt with 
the probability-generating functions but in this paper we focus our attention on the 
expected waiting time and we show how this general solution could be used to obtain 
conditional expected waiting times (PropEH])- We also present on some example how 
it could be used to prove in a simply way some known result deal with presented topic 
(RemEID. 

2 Waiting time on patterns 

Let (^„) be a sequence of i.i.d. random letters from a finite alphabet Q. For a given 
pattern (word) A = (ai, 02, a;) G by A(^k) and A^'^^ we will denote subpatterns 
formed by first and last k letters of A, respectively; = (01,02, ■■■,ak) and A(^) = 
(a/_fc+i, a;_fc+2, oz)- For two patterns A and B we will denote [A = B] = 1 ii A = B 
and [A = B] = if not. Now we define a correlation polynomial of A and B (of 
the lengths / and m, respectively) as 

min{Z,m} 

w^{s)= Yl [Aik) = B^'^]Pr{A^^-'^)s^-''; 

k=l 

if / = min{/, m} then in the above sum for the index k = I we assume that P^A^'^^) = 1. 

Example 2.1. Let Q = {H,T} and ((^„) be a sequence of i.i.d. letters in Q with the 
distribution 

Pi'i.in = H) = p and Pr{^n = T) = q = 1 — p. 

Consider two patterns A = THH and B = THTH . Then the correlation polynomials 
have the following forms: 

w^{s) = ps, w^{s) = 0, w^{s) = 1 and w^{s) = pqs^ + 1. 

Let us note that the correlation polynomials up to some differences in their definition 
can be used to investigate the appearances of patterns not only in Bernoulli trials but 
also in Markov chains (see for instance p]). 

Now we briefly recall results presented by Guibas and Odlyzko in [21 sec. 3]. Consider 
a set of m patterns (words) Ai (1 < z < m) of lengths /«, respectively. We assume that 
the set of patterns is reduced that is none of the patterns contains any other as a 
subpattern. 
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Let Ti denote the stopping time until Ai occurs and r be the stopping time till some 
of considered patterns is observed, i.e. 



r = min{rj : 1 < i < m}. 

Let Pn and p^"- be probabilities Fr(r = n) and Pr(r = Tj = n), respectively. Let gris) 
denote E{s'^) = X^^o^*"-^" probability generating function of random variable r 
and g^' be the generating functions J2'^=oPn''^^y 1 < i < m. Since Pn = J21LiPn^ we 
have that gr = X^I^i 9t*- -^y Qr we denote the generating function for tails probabilities 
of r, i.e. Qris) = ^~=o^"^"' where g„ = Pr{T > n) = Y.k>nPk- 

Let Bn be the set of sequences such that any pattern A^ does not appear in the 
string of the first n letters of these sequences. Notice that Pr[Bn) = Qn- In the system 
of m patterns if we add to each initial n-string in Bn the word Ai then we must check 
if neither it nor other ones appear earlier. Since Pr{Bn) = Qn we get the following 
system of equations 

qnPr{Ai) = Y, J2 [Am = 4^]Pr{4'-%X^ (1) 
j=i k=i 

for each 1 < i < m, which is compatible with (3.2) in [2]. Notice now that 

m 

Qn = Qn+l + ^Pnil (2) 
3=0 

(see (3.1) in [2]). Multiplying ([1]) by and ([2]) by s" and summing from n = to 
infinity we obtain the following system of linear equations 

{l-s)QAs) + ET=i9r'{s) = 1 

PriA,)s''Q,{s)~j:T=iWAM9r'{s) = (1 < z < m). 

Remark 2.2. In our opinion the definition of the correlation polynomial (second line 
from above page 195 [2]) appearing in the system of linear equation formed in Th.3.3 
[2] should be of the following form 

''''' " ^^(^) " ku ^"(^1' •••'^^) " hu ^"(^i) ■ - ■ p^^^y 

Then substituting z = - we get the equivalence of the system of linear equation in 
Th.3.3 [2] and this one described in ([3]). 
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Theorem 2.3. The solution of the system of linear equations ^ has the following 
form 

det BHs) ^ ■ ^ ^ 



E"LidetSJ (s) + (l-s) detS(s) 



and 



Qr{s) 



det B{s) 



ET=i det 13^ {s) + {l-s) det B{s) ' 



(4) 
(5) 



where B denotes a matrix formed by correlations polynomials "W^^, i.e. 



B{s)= w'^Ms) 



l<i,j<m 



and B^{s) is the matrix arisen by replacing the j-th column ofB{s) by the column vector 

\l<i<m- 



Proof. Leading Qr{s) out of the first equation of the system and substituting it 
into the remaining ones we obtain an equivalent system of hnear equations of the form 



m] 



P{A)s'' = Y.9r^i')\nA)s'' + (1 - s)w%{s)] {l<^< 

Let A denote the coefficient matrix of the above system, i.e. 

A{s)=\p{A,)s'^ + {l-s)w%{s)\ . 

L J l<i,j<m 

Notice that because u'^^*(0) = 1 and u'^^(O) = for i 7^ j then ^(0) is the identity 
matrix. Since det ^(0) = 1 and det A{s) is a polynomial, det A{s) 7^ on some 
neighborhood of zero. It means that on this neighborhood there exist solution of the 
system. 

Because the determinant of matrices m x m is a m-linear functional with respect 
to columns (equivalently to rows) then one can check that 



deiA{s) = (1 -s)™deti3(s) + (1 - s)'"-^ J]deti3^(s). 

i=i 

If now similarly A^{s) denotes the matrix formed by replacing the j-th column of 
A{s) by the column vector [P (Ai) s'''-]i<i<rn then the determinant's calculus gives that 
det^-'(s) = (1 — s)"^~^ det B^{s). By the Cramer's rule we obtain 



det^' (s) _ det B\s) 

det A{s) ~ Er=i det B^{s) + (1 - s) det B{s) 



for 1 < i < m. 

Substituting the functions g^^ into the first equation of the system one can 
calculate that 

^ ( ^ detBjs) 

' ET=i det Bi{s) + (1 - s) det B{s) ' 

□ 

Notice that the probability-generating function g^'{s) = Yl'^=oPn''^"' is well define 
on the interval [—1, 1] for sure (it is an analytic function on (—1, 1)). The right hand 
side of dl]) is a rational function equal to g^' on some neighborhood of zero. By the 
analytic extension we know that there exists the limit of the right hand side of (|4]) by 
s — 7- 1^ which is equal to g^^{l). Thus we obtain the following 

Corollary 2.4. The probability Pr{T = Ti) that the pattern Ai precedes all the remain- 
ing m — 1 patterns is equal to g'^^il), that is 

deti3^(l) 
^^'^-^■'= E,"l.detB'(l) - 

where the right hand side of the above equality we understand as the limit of ^ by 
s 1". 

An application of the above to the generalization of the Conway's formula one can 
find in [8]. 

Since Qt{X) is the expected value of r, we can formulate the another 

Corollary 2.5. The expected waiting time till one of m patterns is observed is given 
by 

Fr = (7) 
Er=ideti3^(l)' ^'^ 

where the above right hand side is the limit of ^ by s ^ 1~ . 

Note that the above formula is also true for one pattern A (m = 1) since 

Pr{A) ^ Pr(A(,)) ^ ^ 

(compare Solov'ev's result in [7]). 

Now we can proceed to calculate the expected waiting time for the pattern A^ 
knowing that it appears as a first, namely the conditional expectation of the random 
variable r given the event {r = Tj} that is we prove the following 
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Proposition 2.6. 

Proof. Observe that the conditional expectation E{t\t = Ti) can be expressed in terms 
of the function g^^ as follows 

OO ^ CXD 

E{T\r = Ti) = \^nPr{T = n\T = Ti) = —— -y^nPr{r = Ti = n) 

71=1 ^ n=l 

^ ^.^'(1). (10) 



Pr{r = Ti) ds 

Differentiating the function g^^ given by formula dl]) and taking the value at 1 we get 

i det ET=i det - det B\\) ET=i i det 



ds'^ ' ' [Er=idetS^(l)]2 

det B\\) detg(l) 
^Er=ideti3^(l)'Er=ideti3^(l)- 

Notice that the first summand can be expressed as ^ ^ "^(^t j (1) ^^d the second 
one, by Corollaries 12.41 and I2.5[ is equal to Prij = Ti)ET. Thus, by (ITOj) . we get the 
formula ([9]) on E{t\t = Ti). □ 

Before we present some examples we would like to show applications of our results 
to obtain some known fact contained in 

Remark 2.7. Define a number Aj * Ai as 



Pr{A,) ^ PriA^k)) 

Let us emphasizes that the above coincides with the notation (2.3) in ^y. Consider 
now a matrix 

C = [Aj * Ai] i< .^^.<^ • 

Observe that deti3(l) = YlT=i Pr{Ai) detC and deti5-'(l) = Pr(A) detC^ where 
is the matrix formed by replacing the j-th column of C by the column vector 
["'"]i<i<m' ^^^^ '^^^ rewrite Corollaries 12.41 and 12.51 in terms of matrix C 

as follows 
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Prir = Ti 



det C' 



Er=idetC^- 



and Et 



detC 

Er=i deter 



By the martingale arguments Li proved in [1] (see Theorem 3.1) that for every i 
1 , 2 , . . . , m we have 

m 

ET = Y,Pr{T = Tj){Aj*A,) 
which in our notation is equivalent to the following formula 



det C = '^{Aj * A,) det C\ (l < i < 



m). 



Now we independently prove the above determinant's identity. 

Proof. Let V be extended matrix C on an initial row and column as follows 



floo 


ttoi ao2 • • 










^20 






C 











V 



where aoo = 1, (ctoi, • • • > oom) is the i-th row of the matrix C, i.e. a^j = Aj * Ai, 
1 < j < m, and a^o = 1 for k ^ i and = 0. The Laplace expansion along the zero 
column yields 

det P = det C 

whereas taking the Laplace expansion along the i-th row we obtain 



detV = ^{Aj * Ai){-iy-^^ detVij. 
i=i 

Notice that for every matrix permuting the zero column and zero row in the place 
of removed ones, by the determinant's properties, we get 



detVij = (-1)^+^-2 det 



Thus 



detP = ^(Aj * Ai) det 



and this completes the proof. 



□ 
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Example 2.8. Let Q = {A, C, G, T} and be a sequence of i.i.d. letters in Q with 
the probabihties 

Pr{^^ = A)=Pa, Priin = C)=p,, Pr{^^ = G)=Pg and P{^^ = T)=pt. 

Consider the set of three patterns: Ai = ACQ, A^ = ATG and ^3 = AG. Observe that 
Wj('.{s) = if z 7^ J and w^'(s) = 1. So in this case the matrix B{s) = [wx'-{s)]i<ij<3 
is the identity matrix. For the matrices B^{s) we obtain 

deti3^(s) = P{A,)s^\ 

Since Pr(r = Tj) = j2"^^^det&{i) probabilities that the i-th pattern occurs as a first 
are given by 

Pr(r = Ti) = r^^— = '^2) = , , , , Pr(T = T3) 



Pc + Pt + 1 Pc + Pt + 1 Pc + Pt + 1 



Applying now Colorary 12.51 we obtain 

Er 



PaPgiPc+Pt + 1) 

By Proposition 12.61 we can calculate 



E{t\t = Ti) 

E{r\T = T2) 
E{r\r = n) 



PaPg + 1 



PaPgiPc + Pt + 1)' 

PaPg + 1 
PaPg{Pc+Pt + 1)' 
1 - PaPgiPc + Pt) 



PaPgiPc+Pt + 1) 

Note that for each pattern by the Solov'ev's formula ([8]) we have 
Eti = — ^ — , Et2 = — ^ — , Ets. 



PaPcPg PaPgPt PaPg 

Observe that for different Pa,Pc,Pg,Pt the values Eti, Et2, Er^ are mostly different. 
But in this example we obtained that E{t\t = ri) = E{t\t = T2) for any Pa,Pc,Pg,Pt- 

Remark 2.9. If B{s) is the identity matrix then applying Proposition 12.61 one can 
calculate that 



Er + k 



PriA,)EZiPriAk) 



It means that if additionally patterns Ai and Aj are of the same length then E{t\t - 
Ti) = E{t\t = Tj). For this reason in the above example E{r\T = ri) = E{t\t = T2). 
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Example 2.10. Let now Q = {H, T} and Pr(^„ = H) = p and Pr(^„ = T) = q = 1-p. 
Consider three patterns: Ai = THH, A2 = HTH and = HHT. In this case 

/ 1 ps p^s^^ 

Bis) = iw'^Hs)] = pqs^ pqs^ + l ps 



and 



l<i,j<3 \ 2 I 2 1 

~ \pqs + qs pqs 1 



det B{s) = -p^q^s^ - 2p'^qs^ + pqs^ + 1, 

deti3^(s) = p^ qs^ {—p^ qs^ + pqs^ — ps + 1) , 

detS^(s) = p^ qs^ [l - ps) , 

detS^(s) = p^ qs^ {-pq^ s^ - qs + I) . 



By Corollary 12.41 we get 



1 + q I + q I + q 

Taking into account that q = 1 — p, hj virtue of Corollary 12.51 we obtain the formula 
on the expected waiting time of r 

-p^ + 2n^ + »^ - + w + 1 

= . 

— p){2 — p) 



By Proposition 12.61 one can calculate that 



E{T\T = ri) = Et + 
E{t\t = r2) = Et + 
E{t\t = Ts) = Et + 



(l-p)(l+p-p2)(2-p) 

p3 - 10p2 + lip _ 3 

(l-p)(2-p) ' 

_p4 ^ ^ _ lOp + 3 



p2(2-p) 

In the symmetric case p = q = ^ the above quantities take values 

5 11 
Pr(r = ri) = — , Pr(r = T2) = -, Fr(r = rg) = -, 

6 

and 

^/ , N 86 , , 16 ^, , X , 

E{t\t = n) = — , E{t\t = T2) = —, E{t\t = T3) = 4. 
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